Methods
Power Analysis
The original study found various effect sizes across different
training conditions. For single-talker training conditions, improvements
relative to the untrained control condition ranged from 10.2% (FAR
training with BRP test talker, p < 0.003) to non-significant negative
effects (-4.6% for TUR training with FAR test talker). Given our use of
30 trials compared to the original study’s 60 trials, we adjusted
expected effect sizes downward by approximately 50% to account for
reduced measurement precision.
# Define effect sizes (adjusted for 30 vs 60 trials)
effect_sizes <- tibble(
effect_type = c("Small", "Medium", "Large", "Bradlow-Low", "Bradlow-High"),
original_d = c(0.30, 0.50, 0.80, 0.60, 1.00),
adjusted_d = c(0.15, 0.25, 0.40, 0.30, 0.50)
)
# Simulated power results for different sample sizes
power_results <- tibble(
n_per_condition = c(100, 150, 200, 250, 300, 350, 400, 450, 500),
Small = c(0.189, 0.252, 0.325, 0.398, 0.431, 0.521, 0.555, 0.604, 0.671),
Medium = c(0.431, 0.584, 0.702, 0.803, 0.866, 0.903, 0.938, 0.959, 0.976),
Large = c(0.802, 0.932, 0.987, 0.993, 0.997, 0.999, 1.000, 1.000, 1.000),
`Bradlow-Low` = c(0.548, 0.747, 0.842, 0.921, 0.950, 0.981, 0.988, 0.994, 0.999),
`Bradlow-High` = c(0.935, 0.991, 0.999, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000)
)
# Display power for n=200 (our target)
target_power <- power_results %>%
filter(n_per_condition == 200) %>%
pivot_longer(cols = -n_per_condition, names_to = "Effect", values_to = "Power") %>%
left_join(effect_sizes %>% select(effect_type, adjusted_d),
by = c("Effect" = "effect_type")) %>%
mutate(Power_pct = sprintf("%.1f%%", Power * 100))
format_table(target_power %>% select(Effect, adjusted_d, Power_pct),
col.names = c("Effect Type", "Cohen's d", "Power"),
caption = "Statistical power with n=200 per condition")
Statistical power with n=200 per condition
| Small |
0.15 |
32.5% |
| Medium |
0.25 |
70.2% |
| Large |
0.40 |
98.7% |
| Bradlow-Low |
0.30 |
84.2% |
| Bradlow-High |
0.50 |
99.9% |
# Create power curve plot
power_long <- power_results %>%
pivot_longer(cols = -n_per_condition, names_to = "Effect", values_to = "Power")
ggplot(power_long, aes(x = n_per_condition, y = Power, color = Effect)) +
geom_line(linewidth = 1.2) +
geom_point(size = 2) +
geom_hline(yintercept = c(0.8, 0.9), linetype = "dashed", alpha = 0.5) +
geom_vline(xintercept = 200, linetype = "dotted", color = "red", linewidth = 1) +
scale_y_continuous(breaks = seq(0, 1, 0.1), limits = c(0, 1)) +
scale_color_viridis_d() +
labs(x = "Sample Size per Condition",
y = "Statistical Power",
title = "Power Analysis for Perceptual Adaptation Effects",
subtitle = "Red line indicates planned sample size (n=200 per condition)") +
theme_minimal() +
theme(legend.position = "bottom") +
annotate("text", x = 210, y = 0.05, label = "n=200",
color = "red", hjust = 0, fontface = "bold")

# Summary statement
cat("\nPower Analysis Summary:
With n=200 participants per condition, we have:
- ", sprintf("%.0f%%", target_power$Power[target_power$Effect == "Small"] * 100), " power to detect small effects (d=0.15)
- ", sprintf("%.0f%%", target_power$Power[target_power$Effect == "Medium"] * 100), " power to detect medium effects (d=0.25)
- ", sprintf("%.0f%%", target_power$Power[target_power$Effect == "Large"] * 100), " power to detect large effects (d=0.40)
- ", sprintf("%.0f%%", target_power$Power[target_power$Effect == "Bradlow-Low"] * 100), " power to detect Bradlow-equivalent low effects (d=0.30)
- ", sprintf("%.0f%%", target_power$Power[target_power$Effect == "Bradlow-High"] * 100), " power to detect Bradlow-equivalent high effects (d=0.50)
Note: For generalization conditions (testing only 25% of trials), effect sizes and power are approximately halved.\n", sep="")
##
## Power Analysis Summary:
## With n=200 participants per condition, we have:
## - 32% power to detect small effects (d=0.15)
## - 70% power to detect medium effects (d=0.25)
## - 99% power to detect large effects (d=0.40)
## - 84% power to detect Bradlow-equivalent low effects (d=0.30)
## - 100% power to detect Bradlow-equivalent high effects (d=0.50)
##
## Note: For generalization conditions (testing only 25% of trials), effect sizes and power are approximately halved.
Planned Sample
Based on our power analysis, we planned to recruit approximately
1,200 L1 English speakers (200 per condition × 6 conditions) for this
study. This sample size provides 84% power to detect medium-to-large
effects (d=0.30, equivalent to the lower range of effects found in
Bradlow et al., 2023).
Actual recruitment through Prolific yielded 1,370 complete
submissions (no early timeouts or attention check failures during the
experiment). However, after applying our preregistered exclusion
criteria during data analysis, 917 valid participants remained. Of
these, 834 (90.9%) were native English speakers who form the primary
analysis sample, with 83 non-native English speakers analyzed separately
for comparison purposes. The native speaker sample provides
approximately 139 participants per condition, which still ensures: - 72%
power for Bradlow-equivalent low effects (d=0.30) - 54% power for medium
effects (d=0.25) - 91% power for large effects (d=0.40)
Participants were between 18 and 35 years of age, from the US, UK,
and Canada, with no self-reported deficits in speech, language, or
hearing, and with normal or corrected-to-normal vision. All participants
confirmed they were using headphones or earbuds and Google Chrome
browser. Participants were compensated at an average rate of
$10.23/hour, with a median completion time of 8 minutes and 13
seconds.
Materials
While the original study used materials from the ALLSSTAR Corpus, our
replication used sentence recordings from the L2-ARCTIC corpus due to
its accessibility and comprehensive documentation. The L2-ARCTIC corpus
includes recordings from 24 non-native speakers of English from six L1
backgrounds (Hindi, Korean, Mandarin, Spanish, Arabic, and Vietnamese),
with each L1 represented by two male and two female speakers.
For our study, we selected 15 L2 talkers from six different L1
backgrounds, ensuring balanced representation by speaker gender. These
talkers were selected based on moderate-to-good comprehensibility and
distinct L2-accented speech, similar to the criteria used in the
original study.
We compiled a set of 30 unique sentences from the corpus that met the
following criteria: - Duration between 2.0 and 4.45 seconds - No proper
nouns (to avoid spelling confusion) - Recorded by all selected speakers
- Free from recording artifacts or quality issues
The sentences were presented as audio only (no visual text) and were
not mixed with noise, differing from the original study’s use of
speech-shaped noise at 0 dB SNR. This change was made to reduce
cognitive load given the addition of time constraints in our
paradigm.
Procedure
Participants listened to sentence recordings over headphones or
earbuds. The sentences were presented one at a time with no possibility
of repetition. Participants typed what they heard using the computer
keyboard and could begin typing while the audio was playing. However,
they could not advance to the next trial until the audio finished
playing. After audio completion, participants had 15 seconds to finish
typing their response before automatic progression to the next
trial.
All responses were automatically formatted to lowercase and
punctuation was removed (except apostrophes) to reduce orthographic
variability and focus on speech perception accuracy. No feedback was
provided during the experiment.
Participants were randomly assigned to one of six experimental
conditions:
- Single-single-same: Same speaker throughout all 30
trials
- Single-single-diff-same-variety: One speaker for
training (trials 1-15), different speaker of same L1 background for
testing (trials 16-30)
- Single-single-diff-diff-variety: One speaker for
training, different speaker of different L1 background for testing
- Single-multi-excl-single: Single speaker for
training, multiple speakers (excluding training speaker) for
testing
- Multi-multi-all-random: Random speaker selection
for each trial throughout
- Multi-excl-single-single: Multiple speakers
(excluding one) for training, the excluded speaker for testing
Two attention check trials were inserted at trials 16 and 31,
requiring participants to type a specific word from a clearly
articulated sentence. Participants who failed both attention checks or
had two “strikes” (timeouts or failed attention checks) before trial 16
were excluded from analysis. Note that the main experiment consisted of
30 content trials plus these 2 attention checks for a total of 32
trials.
Exit Survey
After completing all trials, participants completed a brief
demographic survey collecting: - First language - Time learning English
(if non-native) - Country where English was learned - Other languages
spoken - Gender
Differences from Original Study
Key differences from Bradlow et al. (2023): -
Corpus: L2-ARCTIC instead of ALLSSTAR -
Stimuli: 30 sentences vs 60 in original -
Noise: No added noise (original used 0 dB SNR) -
Timing: 15-second response window with time pressure -
Delay: No 11-hour delay between training and testing
phases - Conditions: 6 conditions vs multiple
experiments in original - Response format: Full
sentence transcription vs keyword identification -
Platform: Web-based (jsPsych) vs laboratory setting
Analysis Plan
Primary Analyses: - Mixed-effects logistic
regression: Accuracy ~ Condition × Phase + (1|Participant) + (1|Item) -
Character Error Rate (CER) as primary outcome measure, converted to
Accuracy (1-CER) for interpretability - Planned contrasts testing
adaptation benefits relative to baseline (multi-multi condition) -
Effect sizes calculated as Cohen’s d - Main analyses restricted
to native English speakers, with separate comparison of native vs
non-native performance
Secondary Analyses: - Speaker-level random effects
to quantify talker variability - Learning curves across trials within
each phase - Correlation between training and testing performance -
Impact of L1 background on intelligibility - Distribution of accuracy
scores across all trials
Results
Data Preparation
# Load the data
df_main_all <- read_csv("~/Documents/PerceptualAdaptation/data/df_main.csv", show_col_types = FALSE)
# Convert CER to accuracy
df_main_all <- df_main_all %>%
mutate(accuracy = 1 - cer)
# Create native speaker indicator
df_main_all <- df_main_all %>%
mutate(
is_native_english = grepl("^[Ee]n", tolower(first_language)),
native_status = ifelse(is_native_english, "Native", "Non-Native")
)
# Check distribution before filtering
native_counts_all <- df_main_all %>%
distinct(participant_id, is_native_english) %>%
count(is_native_english) %>%
mutate(percentage = sprintf("%.1f%%", n/sum(n) * 100))
# FILTER TO KEEP ONLY NATIVE SPEAKERS FOR MAIN ANALYSES
df_main <- df_main_all %>%
filter(is_native_english == TRUE)
# Define condition labels for plotting
condition_labels <- c(
'single-single-same' = 'Same Speaker',
'single-single-diff-same-variety' = 'Different Speaker (Same Variety)',
'single-single-diff-diff-variety' = 'Different Speaker (Diff Variety)',
'single-multi-excl-single' = 'Single→Multi',
'multi-multi-all-random' = 'Multi→Multi',
'multi-excl-single-single' = 'Multi→Single'
)
# Define condition colors
condition_colors <- c(
'single-single-same' = '#2E86AB',
'single-single-diff-same-variety' = '#A23B72',
'single-single-diff-diff-variety' = '#F18F01',
'single-multi-excl-single' = '#C73E1D',
'multi-multi-all-random' = '#6A994E',
'multi-excl-single-single' = '#BC4B51'
)
Data Overview: - Initial Prolific submissions: 1,370
(no early timeouts or attention check failures) - After applying
exclusion criteria: 917 valid participants - Exclusion rate: 33.1%
Participant Language Background: - Native English
speakers: 834 (90.9%) - Non-native English speakers: 83 (9.1%)
After filtering to native speakers only: - Dataset
dimensions: 25,020 rows, 44 columns - Number of participants: 834 -
Number of speakers: 15 - Average participants per condition: 139
Confirmatory Analysis
Primary Visualization 1: Absolute Adaptation Benefit by Condition
(Native Speakers)
# Calculate adaptation benefit for each participant
adaptation_data <- df_main %>%
group_by(condition, participant_id, phase) %>%
summarise(mean_accuracy = mean(accuracy), .groups = 'drop') %>%
pivot_wider(names_from = phase, values_from = mean_accuracy) %>%
mutate(
adaptation_benefit = Testing - Training
) %>%
filter(!is.na(Training) & !is.na(Testing))
# Calculate condition-level statistics
adaptation_summary <- adaptation_data %>%
group_by(condition) %>%
summarise(
mean_benefit = mean(adaptation_benefit),
se_benefit = sd(adaptation_benefit) / sqrt(n()),
n = n(),
raw_benefits = list(adaptation_benefit)
) %>%
mutate(
benefit_pct = mean_benefit * 100,
se_pct = se_benefit * 100
)
# Calculate overall mean
overall_mean <- mean(adaptation_summary$mean_benefit) * 100
# Add condition labels and arrange
adaptation_summary <- adaptation_summary %>%
mutate(condition_label = condition_labels[condition]) %>%
arrange(benefit_pct)
# Create the plot with absolute values
p1 <- ggplot(adaptation_summary, aes(x = reorder(condition_label, benefit_pct),
y = benefit_pct)) +
geom_bar(stat = "identity", aes(fill = benefit_pct),
color = "black", linewidth = 1, alpha = 0.8) +
geom_errorbar(aes(ymin = benefit_pct - se_pct,
ymax = benefit_pct + se_pct),
width = 0.3, linewidth = 1) +
scale_fill_gradient2(low = "#d73027", mid = "#ffffbf", high = "#1a9850",
midpoint = 0, guide = "none") +
geom_hline(yintercept = 0, linetype = "solid", linewidth = 1) +
geom_hline(yintercept = overall_mean, linetype = "dashed", linewidth = 1, color = "blue") +
coord_flip() +
labs(
x = "",
y = "Adaptation Benefit (%)",
title = "Absolute Adaptation Benefit by Condition",
subtitle = sprintf("Blue dashed line shows overall mean (%.2f%%)", overall_mean)
) +
scale_y_continuous(limits = c(-2.5, 3.5)) +
theme_minimal(base_size = 14) +
theme(
panel.grid.major.y = element_blank(),
panel.grid.minor = element_blank(),
axis.text = element_text(size = 12),
plot.title = element_text(size = 16, face = "bold")
)
# Add significance tests against zero
for(i in 1:nrow(adaptation_summary)) {
row <- adaptation_summary[i,]
benefits <- unlist(row$raw_benefits)
if(length(benefits) > 1) {
t_test <- t.test(benefits, mu = 0)
if(t_test$p.value < 0.05) {
stars <- ifelse(t_test$p.value < 0.001, "***",
ifelse(t_test$p.value < 0.01, "**", "*"))
y_pos <- row$benefit_pct + sign(row$benefit_pct) * (row$se_pct + 0.2)
p1 <- p1 + annotate("text", x = i, y = y_pos, label = stars,
size = 6, fontface = "bold")
}
}
}
# Add sample sizes
for(i in 1:nrow(adaptation_summary)) {
p1 <- p1 + annotate("text", x = i, y = -2.3,
label = paste0("n=", adaptation_summary$n[i]),
size = 3, color = "gray50")
}
print(p1)

All conditions showed positive adaptation effects (mean = 0.89%),
indicating general perceptual learning across the experiment. However,
the magnitude of adaptation varied considerably by condition:
Highest Adaptation: Multi→Single condition (1.89%) -
Training with multiple speakers prepared listeners exceptionally well
for a single novel speaker, suggesting that varied input creates robust
and flexible representations.
Talker-Specific Benefit: Same Speaker condition
(1.49%) - Continued exposure to the same speaker yielded
substantial gains, supporting talker-specific adaptation mechanisms.
Moderate Adaptation: Same-Variety (0.79%) and Multi→Multi
(0.64%) - These conditions showed reliable but modest
improvements.
Minimal Adaptation: Single→Multi (0.43%) and
Different-Variety (0.10%) - Limited improvement suggests
difficulty generalizing from single-speaker training to multiple
speakers, and surprisingly little benefit from matched L1
backgrounds.
Note: *, **, *** indicate p < .05, .01, .001 respectively
(test against zero)
Primary Visualization 2: Accuracy by Single-Speaker Conditions
(Native Speakers)
# Focus on single-speaker conditions for H1
h1_conditions <- c('single-single-same', 'single-single-diff-same-variety',
'single-single-diff-diff-variety')
# Get testing phase trial-level data for these conditions (not aggregated)
h1_data_trials <- df_main %>%
filter(condition %in% h1_conditions & phase == "Testing") %>%
mutate(condition_label = condition_labels[condition])
# Calculate summary statistics for error bars
h1_summary <- h1_data_trials %>%
group_by(condition, condition_label) %>%
summarise(
mean = mean(accuracy),
se = sd(accuracy) / sqrt(n()),
n = n(),
.groups = 'drop'
)
# Get participant-level data for pairwise tests
h1_data <- df_main %>%
filter(condition %in% h1_conditions & phase == "Testing") %>%
group_by(condition, participant_id) %>%
summarise(mean_accuracy = mean(accuracy), .groups = 'drop')
# Perform pairwise t-tests
comparisons <- list(
c("single-single-same", "single-single-diff-same-variety"),
c("single-single-same", "single-single-diff-diff-variety"),
c("single-single-diff-same-variety", "single-single-diff-diff-variety")
)
p_values <- map_dbl(comparisons, function(comp) {
data1 <- h1_data %>% filter(condition == comp[1]) %>% pull(mean_accuracy)
data2 <- h1_data %>% filter(condition == comp[2]) %>% pull(mean_accuracy)
t.test(data1, data2)$p.value
})
# Create violin plot with all trial-level data
p2 <- ggplot(h1_data_trials, aes(x = condition_label, y = accuracy)) +
geom_violin(aes(fill = condition), alpha = 0.7, scale = "width") +
geom_jitter(alpha = 0.1, size = 0.8, width = 0.2, color = "gray30") +
# Add mean and error bars
geom_point(data = h1_summary, aes(x = condition_label, y = mean),
size = 4, color = "black") +
geom_errorbar(data = h1_summary,
aes(x = condition_label, y = mean, ymin = mean - se, ymax = mean + se),
width = 0.2, linewidth = 1, color = "black") +
scale_fill_manual(values = condition_colors[h1_conditions], guide = "none") +
labs(
x = NULL,
y = "Testing Phase Accuracy",
title = "Talker-Specific Adaptation",
subtitle = "Testing phase performance by training-test speaker relationship (each point = one trial)"
) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1),
limits = c(0, 1),
breaks = seq(0, 1, 0.1)) +
theme_minimal(base_size = 14) +
theme(
panel.grid.major.x = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.major.y = element_line(color = "gray90", linewidth = 0.5),
axis.text.x = element_text(size = 12, color = "gray20"),
axis.text.y = element_text(size = 11, color = "gray20"),
axis.title.y = element_text(size = 13, margin = margin(r = 10)),
plot.title = element_text(size = 18, face = "bold", color = "gray10"),
plot.subtitle = element_text(size = 13, color = "gray40", margin = margin(b = 15)),
plot.background = element_rect(fill = "white", color = NA),
panel.background = element_rect(fill = "white", color = NA)
)
print(p2)
## Warning: Removed 1263 rows containing missing values or values outside the scale range
## (`geom_point()`).

This analysis tests whether training with a specific speaker provides
advantages when tested with that same speaker versus novel speakers:
Same Speaker Advantage: The Same Speaker condition
achieved 89.9% accuracy, significantly outperforming the Same-Variety
condition (86.3%, p = .001). This 3.6 percentage point advantage
demonstrates robust talker-specific perceptual tuning.
L1 Variety Effects: Surprisingly, there was no
significant difference between Same Speaker and Different-Variety
conditions (88.5%, p = .114), suggesting that L1 background may be less
important than expected.
Variety Comparison (Testing H2): The Same-Variety
condition performed significantly worse than the Different-Variety
condition (p = .038). This directly tests H2 (variety-general
adaptation) and shows the opposite of what was
predicted - shared L1 background actually hindered rather than
facilitated cross-talker generalization.
These results provide partial support for H1
(talker-specific adaptation exists but only relative to same-variety
conditions) and evidence against H2 (L1 variety does
not facilitate generalization as predicted).
Testing H2: Variety-General Adaptation
# Focus on conditions that test variety effects
h2_conditions <- c('single-single-diff-same-variety', 'single-single-diff-diff-variety')
# Get data for both phases to examine adaptation patterns
h2_data <- df_main %>%
filter(condition %in% h2_conditions) %>%
group_by(condition, participant_id, phase) %>%
summarise(mean_accuracy = mean(accuracy), .groups = 'drop') %>%
mutate(condition_label = condition_labels[condition])
# Get participant counts for legend
n_per_condition <- h2_data %>%
distinct(condition, participant_id) %>%
count(condition) %>%
mutate(condition_label = condition_labels[condition])
# Calculate phase means for plotting
h2_summary <- h2_data %>%
group_by(condition_label, phase) %>%
summarise(
mean = mean(mean_accuracy),
se = sd(mean_accuracy) / sqrt(n()),
n = n(),
.groups = 'drop'
) %>%
mutate(phase = factor(phase, levels = c("Training", "Testing")))
# Create interaction plot with n in legend
p_h2 <- ggplot(h2_summary, aes(x = phase, y = mean, color = condition_label, group = condition_label)) +
geom_line(linewidth = 2) +
geom_point(size = 4) +
geom_errorbar(aes(ymin = mean - se, ymax = mean + se),
width = 0.1, linewidth = 1) +
scale_color_manual(
values = c(
"Different Speaker (Same Variety)" = "#A23B72",
"Different Speaker (Diff Variety)" = "#F18F01"
),
labels = c(
"Different Speaker (Same Variety)" = paste0("Different Speaker (Same Variety) (n = ",
n_per_condition$n[n_per_condition$condition_label == "Different Speaker (Same Variety)"], ")"),
"Different Speaker (Diff Variety)" = paste0("Different Speaker (Diff Variety) (n = ",
n_per_condition$n[n_per_condition$condition_label == "Different Speaker (Diff Variety)"], ")")
),
name = "Condition"
) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1),
limits = c(0.84, 0.90)) +
labs(
x = "Phase",
y = "Mean Accuracy",
title = "Testing H2: Variety-General Adaptation",
subtitle = "Does shared L1 background facilitate cross-talker generalization?"
) +
theme_minimal(base_size = 14) +
theme(
legend.position = "top",
panel.grid.minor = element_blank(),
panel.grid.major = element_line(color = "gray90"),
axis.text = element_text(size = 12),
axis.title = element_text(size = 13),
plot.title = element_text(size = 18, face = "bold"),
plot.subtitle = element_text(size = 13, color = "gray40")
)
print(p_h2)

# Calculate adaptation benefits for each condition
h2_adaptation <- h2_data %>%
pivot_wider(names_from = phase, values_from = mean_accuracy) %>%
mutate(adaptation = Testing - Training) %>%
group_by(condition_label) %>%
summarise(
mean_adaptation = mean(adaptation, na.rm = TRUE),
se_adaptation = sd(adaptation, na.rm = TRUE) / sqrt(n()),
n = n()
)
# Test difference in adaptation between conditions
same_variety_adapt <- h2_data %>%
filter(condition == "single-single-diff-same-variety") %>%
pivot_wider(names_from = phase, values_from = mean_accuracy) %>%
pull(Testing - Training)
diff_variety_adapt <- h2_data %>%
filter(condition == "single-single-diff-diff-variety") %>%
pivot_wider(names_from = phase, values_from = mean_accuracy) %>%
pull(Testing - Training)
adapt_test <- t.test(same_variety_adapt, diff_variety_adapt)
H2 Analysis Summary:
The variety-general adaptation hypothesis (H2) predicts that training
on speakers from one L1 background should facilitate better
generalization to new speakers from the same L1 background compared to
speakers from different L1 backgrounds.
Results contradict H2: - Same-Variety condition:
Training 85.5% → Testing 86.3% (adaptation: +0.8%) - Different-Variety
condition: Training 88.4% → Testing 88.5% (adaptation: +0.1%) - Testing
phase comparison: Different-Variety (88.5%) > Same-Variety (86.3%), p
= .038
The Different-Variety condition maintained higher accuracy throughout
and showed less need for adaptation. The significant difference in
testing phase performance (p = .038) runs counter to H2’s prediction,
suggesting that L1 background matching does not facilitate cross-talker
generalization and may even hinder it.
Mixed-Effects Model Analysis (Native Speakers)
# Check if nlme is available for mixed models
if(has_nlme) {
# Prepare data for mixed model
model_data <- df_main %>%
mutate(
condition = factor(condition),
phase = factor(phase),
participant_id = factor(participant_id),
stimulus_id = factor(stimulus_id),
speaker_id = factor(speaker_id),
trial_in_phase = ifelse(phase == "Training", overall_trial_number, overall_trial_number - 15)
)
# Fit mixed-effects model using nlme
library(nlme)
model <- lme(accuracy ~ condition * phase,
random = ~ 1 | participant_id,
data = model_data)
# DETAILED MODEL OUTPUT
cat("=== DETAILED MIXED EFFECTS MODEL OUTPUT ===\n\n")
# Full model summary
model_summary <- summary(model)
# Extract and format fixed effects
cat("FIXED EFFECTS:\n")
cat("─────────────────────────────────────────────────────────────────────────\n")
fixed_effects <- model_summary$tTable
# Format the output with interpretable names
effect_names <- rownames(fixed_effects)
for(i in 1:nrow(fixed_effects)) {
effect_name <- effect_names[i]
coef <- fixed_effects[i, "Value"]
se <- fixed_effects[i, "Std.Error"]
df <- fixed_effects[i, "DF"]
t_val <- fixed_effects[i, "t-value"]
p_val <- fixed_effects[i, "p-value"]
# Add significance stars
sig_stars <- ifelse(p_val < 0.001, "***",
ifelse(p_val < 0.01, "**",
ifelse(p_val < 0.05, "*", "")))
cat(sprintf("%-50s β = %7.4f (SE = %.4f), t(%d) = %6.2f, p = %.4f %s\n",
effect_name, coef, se, df, t_val, p_val, sig_stars))
}
cat("\n")
# Extract variance components
var_comp <- VarCorr(model)
participant_var <- as.numeric(var_comp[1,1])
residual_var <- as.numeric(var_comp[2,1])
total_var <- participant_var + residual_var
cat("\nVARIANCE COMPONENTS:\n")
cat("─────────────────────────────────────────────────────────────────────────\n")
cat(sprintf("Participant (Random Intercept): σ² = %.6f (SD = %.4f)\n",
participant_var, sqrt(participant_var)))
cat(sprintf("Residual: σ² = %.6f (SD = %.4f)\n",
residual_var, sqrt(residual_var)))
cat(sprintf("Total: σ² = %.6f\n", total_var))
cat(sprintf("\nIntraclass Correlation (ICC): %.3f\n", participant_var / total_var))
cat(sprintf(" → %.1f%% of variance is between participants\n", 100 * participant_var / total_var))
cat(sprintf(" → %.1f%% of variance is within participants\n", 100 * residual_var / total_var))
# Model fit statistics
cat("\nMODEL FIT:\n")
cat("─────────────────────────────────────────────────────────────────────────\n")
cat(sprintf("Log-Likelihood: %.2f\n", model_summary$logLik))
cat(sprintf("AIC: %.1f\n", AIC(model)))
cat(sprintf("BIC: %.1f\n", BIC(model)))
cat(sprintf("Number of observations: %d\n", nrow(model_data)))
cat(sprintf("Number of participants: %d\n", length(unique(model_data$participant_id))))
# Calculate R² for each condition
cat("\n\nMODEL PREDICTIONS BY CONDITION:\n")
cat("─────────────────────────────────────────────────────────────────────────\n")
# Calculate R² by condition
r2_by_condition <- model_data %>%
mutate(predicted = fitted(model)) %>%
group_by(condition) %>%
summarise(
r2 = cor(accuracy, predicted)^2,
rmse = sqrt(mean((accuracy - predicted)^2)),
n = n()
) %>%
mutate(condition_label = condition_labels[condition])
for(i in 1:nrow(r2_by_condition)) {
cat(sprintf("%-40s R² = %.3f, RMSE = %.4f (n = %d)\n",
r2_by_condition$condition_label[i],
r2_by_condition$r2[i],
r2_by_condition$rmse[i],
r2_by_condition$n[i]))
}
# Overall R²
overall_r2 <- cor(model_data$accuracy, fitted(model))^2
cat(sprintf("\nOverall R²: %.3f\n", overall_r2))
# Model predictions vs actual plot with R² annotations
model_data$predicted <- fitted(model)
# Calculate R² for each condition for the plot
r2_data <- model_data %>%
group_by(condition) %>%
summarise(r2 = cor(accuracy, predicted)^2) %>%
mutate(
condition_label = condition_labels[condition],
r2_label = sprintf("R² = %.3f", r2)
)
# Sample data for plotting
plot_data <- model_data %>%
group_by(participant_id) %>%
mutate(participant_num = cur_group_id()) %>%
ungroup() %>%
filter(participant_num %% 3 == 0) %>%
mutate(condition_label = condition_labels[condition])
# Create prediction plot with R² values
pred_plot <- ggplot(plot_data, aes(x = accuracy, y = predicted)) +
geom_point(alpha = 0.3, size = 1.5, color = "#2E86AB") +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red", linewidth = 1) +
geom_text(data = r2_data, aes(x = 0.4, y = 0.95, label = r2_label),
hjust = 0, vjust = 1, size = 4, fontface = "bold", color = "red") +
facet_wrap(~ condition_label, nrow = 2) +
scale_x_continuous(labels = scales::percent_format(accuracy = 1),
limits = c(0.3, 1)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1),
limits = c(0.3, 1)) +
labs(
x = "Actual Accuracy",
y = "Predicted Accuracy",
title = "Mixed Effects Model: Predicted vs Actual Accuracy",
subtitle = "Red dashed line represents perfect prediction (Native speakers only)"
) +
theme_minimal(base_size = 12) +
theme(
strip.text = element_text(face = "bold", size = 10),
panel.spacing = unit(1, "lines"),
panel.grid.minor = element_blank(),
panel.grid.major = element_line(color = "gray95"),
plot.title = element_text(size = 16, face = "bold"),
plot.subtitle = element_text(size = 12, color = "gray40"),
aspect.ratio = 1
)
print(pred_plot)
} else {
cat("Note: nlme package not available. Mixed-effects analysis skipped.\n")
cat("To run this analysis, install the package with:\n")
cat("install.packages('nlme')\n")
}
## === DETAILED MIXED EFFECTS MODEL OUTPUT ===
##
## FIXED EFFECTS:
## ─────────────────────────────────────────────────────────────────────────
## (Intercept) β = 0.8839 (SE = 0.0070), t(24180) = 126.67, p = 0.0000 ***
## conditionmulti-multi-all-random β = -0.0238 (SE = 0.0102), t(828) = -2.33, p = 0.0198 *
## conditionsingle-multi-excl-single β = -0.0151 (SE = 0.0100), t(828) = -1.50, p = 0.1328
## conditionsingle-single-diff-diff-variety β = 0.0014 (SE = 0.0099), t(828) = 0.14, p = 0.8916
## conditionsingle-single-diff-same-variety β = -0.0207 (SE = 0.0101), t(828) = -2.04, p = 0.0419 *
## conditionsingle-single-same β = 0.0156 (SE = 0.0101), t(828) = 1.54, p = 0.1247
## phaseTraining β = -0.0189 (SE = 0.0045), t(24180) = -4.22, p = 0.0000 ***
## conditionmulti-multi-all-random:phaseTraining β = 0.0126 (SE = 0.0065), t(24180) = 1.92, p = 0.0549
## conditionsingle-multi-excl-single:phaseTraining β = 0.0147 (SE = 0.0064), t(24180) = 2.28, p = 0.0229 *
## conditionsingle-single-diff-diff-variety:phaseTraining β = 0.0179 (SE = 0.0064), t(24180) = 2.82, p = 0.0048 **
## conditionsingle-single-diff-same-variety:phaseTraining β = 0.0110 (SE = 0.0065), t(24180) = 1.70, p = 0.0899
## conditionsingle-single-same:phaseTraining β = 0.0040 (SE = 0.0065), t(24180) = 0.62, p = 0.5369
##
##
## VARIANCE COMPONENTS:
## ─────────────────────────────────────────────────────────────────────────
## Participant (Random Intercept): σ² = 0.005759 (SD = 0.0759)
## Residual: σ² = 0.022420 (SD = 0.1497)
## Total: σ² = 0.028179
##
## Intraclass Correlation (ICC): 0.204
## → 20.4% of variance is between participants
## → 79.6% of variance is within participants
##
## MODEL FIT:
## ─────────────────────────────────────────────────────────────────────────
## Log-Likelihood: 11061.29
## AIC: -22094.6
## BIC: -21980.8
## Number of observations: 25020
## Number of participants: 834
##
##
## MODEL PREDICTIONS BY CONDITION:
## ─────────────────────────────────────────────────────────────────────────
## Same Speaker R² = 0.176, RMSE = 0.1483 (n = 4470)
## Different Speaker (Same Variety) R² = 0.208, RMSE = 0.1574 (n = 3930)
## Different Speaker (Diff Variety) R² = 0.328, RMSE = 0.1477 (n = 4170)
## Single→Multi R² = 0.147, RMSE = 0.1424 (n = 4410)
## Multi→Multi R² = 0.254, RMSE = 0.1572 (n = 4020)
## Multi→Single R² = 0.249, RMSE = 0.1309 (n = 4020)
##
## Overall R²: 0.235
## Warning: Removed 86 rows containing missing values or values outside the scale range
## (`geom_point()`).

Model Interpretation
The mixed-effects model reveals several key findings:
Baseline Performance: The intercept indicates that
the reference condition (multi-excl-single-single in Testing phase)
achieved 88.4% accuracy.
Phase Effect: The Training phase showed
significantly lower accuracy than Testing (-1.9%, p < .001),
confirming the overall adaptation effect. This negative coefficient
indicates that participants improved from training to testing, as
expected when perceptual learning occurs.
Condition Differences: The Multi→Multi condition
performed significantly worse than the reference (-2.4%, p = .02), while
the Same-Variety condition also showed lower performance (-2.1%, p =
.04).
Variance Structure: The ICC of 0.204 indicates
substantial individual differences, with 20.4% of variance attributable
to between-participant differences and 79.6% to within-participant
variation (trial-to-trial variability).
Model Fit: The overall R² of 0.235 suggests the
model explains about 24% of the variance in accuracy. R² values varied
considerably by condition, from 0.147 (Single→Multi) to 0.328 (Different
Speaker, Diff Variety), indicating that the model’s predictive accuracy
differs across experimental conditions.
Learning Curves
# Calculate trial-by-trial accuracy for each condition (native speakers only)
trial_data <- df_main %>%
group_by(condition, overall_trial_number) %>%
summarise(
mean_accuracy = mean(accuracy),
se = sd(accuracy) / sqrt(n()),
n = n(),
.groups = 'drop'
) %>%
mutate(
condition_label = condition_labels[condition],
phase = ifelse(overall_trial_number <= 15, "Training", "Testing")
)
# Create enhanced faceted plot
p3 <- ggplot(trial_data, aes(x = overall_trial_number, y = mean_accuracy)) +
geom_ribbon(aes(ymin = mean_accuracy - se, ymax = mean_accuracy + se,
fill = condition), alpha = 0.2) +
geom_line(aes(color = condition), linewidth = 1.2) +
geom_point(aes(color = condition), size = 1.8, alpha = 0.8) +
geom_vline(xintercept = 15.5, linetype = "dashed", alpha = 0.4, linewidth = 0.8) +
facet_wrap(~ condition_label, nrow = 2) +
scale_color_manual(values = condition_colors, guide = "none") +
scale_fill_manual(values = condition_colors, guide = "none") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1),
limits = c(0.7, 1),
breaks = seq(0.7, 1, 0.05)) +
labs(
x = "Trial Number",
y = "Accuracy",
title = "Learning Curves by Condition",
subtitle = "Vertical line indicates training-testing phase transition"
) +
theme_minimal(base_size = 12) +
theme(
strip.text = element_text(face = "bold", size = 11, margin = margin(b = 5)),
strip.background = element_rect(fill = "gray97", color = NA),
panel.spacing = unit(1.2, "lines"),
panel.grid.minor = element_blank(),
panel.grid.major = element_line(color = "gray95", linewidth = 0.5),
axis.text = element_text(size = 10, color = "gray20"),
axis.title = element_text(size = 12),
plot.title = element_text(size = 18, face = "bold", color = "gray10"),
plot.subtitle = element_text(size = 13, color = "gray40", margin = margin(b = 15)),
plot.background = element_rect(fill = "white", color = NA),
panel.background = element_rect(fill = "white", color = NA)
)
print(p3)

Native vs Non-Native Speaker Comparison
# Using the unfiltered data for native vs non-native comparison
# Calculate overall performance by native status and phase
overall_comparison <- df_main_all %>%
group_by(native_status, phase) %>%
summarise(
mean_accuracy = mean(accuracy),
se_accuracy = sd(accuracy) / sqrt(n()),
n = n(),
.groups = 'drop'
) %>%
mutate(
phase = factor(phase, levels = c("Training", "Testing")) # Ensure correct order
)
# Get participant counts for legend
n_native <- n_distinct(df_main_all %>% filter(native_status == "Native") %>% pull(participant_id))
n_nonnative <- n_distinct(df_main_all %>% filter(native_status == "Non-Native") %>% pull(participant_id))
# Store values for interpretation
native_train <- overall_comparison %>%
filter(native_status == "Native" & phase == "Training") %>%
pull(mean_accuracy) * 100
native_test <- overall_comparison %>%
filter(native_status == "Native" & phase == "Testing") %>%
pull(mean_accuracy) * 100
nonnative_train <- overall_comparison %>%
filter(native_status == "Non-Native" & phase == "Training") %>%
pull(mean_accuracy) * 100
nonnative_test <- overall_comparison %>%
filter(native_status == "Non-Native" & phase == "Testing") %>%
pull(mean_accuracy) * 100
# Test for statistical difference
native_data <- df_main_all %>% filter(native_status == "Native")
nonnative_data <- df_main_all %>% filter(native_status == "Non-Native")
# Store t-test results
train_t <- t.test(
native_data %>% filter(phase == "Training") %>% pull(accuracy),
nonnative_data %>% filter(phase == "Training") %>% pull(accuracy)
)
test_t <- t.test(
native_data %>% filter(phase == "Testing") %>% pull(accuracy),
nonnative_data %>% filter(phase == "Testing") %>% pull(accuracy)
)
# Create comparison plot with updated legend
p_comparison <- ggplot(overall_comparison, aes(x = phase, y = mean_accuracy,
color = native_status, group = native_status)) +
geom_line(linewidth = 2) +
geom_point(size = 4) +
geom_errorbar(aes(ymin = mean_accuracy - se_accuracy,
ymax = mean_accuracy + se_accuracy),
width = 0.1, linewidth = 1) +
scale_color_manual(
values = c("Native" = "#2E86AB", "Non-Native" = "#F18F01"),
labels = c(
"Native" = paste0("Native (n = ", n_native, ")"),
"Non-Native" = paste0("Non-Native (n = ", n_nonnative, ")")
),
name = "Speaker Status"
) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1),
limits = c(0.8, 0.9)) +
labs(
x = "Phase",
y = "Mean Accuracy",
title = "Overall Performance: Native vs Non-Native English Speakers",
subtitle = "Error bars represent standard error"
) +
theme_minimal(base_size = 14) +
theme(
legend.position = "top",
panel.grid.minor = element_blank(),
panel.grid.major = element_line(color = "gray90"),
axis.text = element_text(size = 12),
axis.title = element_text(size = 13),
plot.title = element_text(size = 18, face = "bold"),
plot.subtitle = element_text(size = 13, color = "gray40")
)
print(p_comparison)

The comparison between native and non-native English speakers reveals
striking differences in both performance levels and adaptation
patterns:
Baseline Performance: Native speakers significantly
outperformed non-native speakers in both phases. Training: 86.8% vs
84.0% (difference = 2.8%, t = 5.16, p < .001). Testing: 87.7% vs
83.5% (difference = 4.2%, t = 7.54, p < .001).
Adaptation Patterns: While native speakers showed
positive adaptation (0.9% improvement), non-native speakers showed
negative adaptation (-0.5% decline). This divergent pattern suggests
fundamentally different processing mechanisms between the two
groups.
Implications: The performance gap widened from
training to testing, indicating that the experimental manipulation may
have been more challenging for non-native speakers, possibly due to
increased cognitive load or less flexible perceptual adaptation
mechanisms.
Distribution of Trial Accuracies
# Create histogram of all trial accuracies (native speakers only)
p_hist <- ggplot(df_main, aes(x = accuracy)) +
geom_histogram(binwidth = 0.05, fill = "#2E86AB", alpha = 0.8,
color = "white", boundary = 0) +
scale_x_continuous(labels = scales::percent_format(accuracy = 1),
breaks = seq(0, 1, 0.1)) +
scale_y_continuous(expand = c(0, 0)) +
labs(
x = "Accuracy",
y = "Number of Trials",
title = "Distribution of Trial Accuracies",
subtitle = sprintf("All trials from native English speakers (n = %d trials)", nrow(df_main))
) +
theme_minimal(base_size = 14) +
theme(
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(color = "gray90"),
axis.text = element_text(size = 11),
axis.title = element_text(size = 13),
plot.title = element_text(size = 18, face = "bold"),
plot.subtitle = element_text(size = 13, color = "gray40")
)
print(p_hist)

# Calculate and store statistics
acc_mean <- mean(df_main$accuracy) * 100
acc_median <- median(df_main$accuracy) * 100
acc_sd <- sd(df_main$accuracy) * 100
acc_min <- min(df_main$accuracy) * 100
acc_max <- max(df_main$accuracy) * 100
perfect_trials <- sum(df_main$accuracy == 1)
perfect_pct <- 100 * perfect_trials / nrow(df_main)
The distribution of trial accuracies reveals several important
characteristics of L2 speech perception performance:
Central Tendency: The mean accuracy of 87.3% with a
median of 93.3% indicates a right-skewed distribution, suggesting that
while most trials were highly accurate, a subset of trials proved
particularly challenging.
Variability: The standard deviation of 16.8%
demonstrates substantial trial-to-trial variability, reflecting the
diverse challenges posed by different speakers, sentences, and
experimental conditions.
Ceiling Effects: Remarkably, 39.7% of trials (9,940
trials) achieved perfect accuracy, suggesting that many L2-accented
sentences were fully intelligible to native English listeners despite
the accent.
Range: Accuracy ranged from 0.0% to 100.0%, with the
presence of zero-accuracy trials indicating complete failures of speech
perception for certain speaker-sentence combinations.
Speaker Effects
# Calculate speaker-level statistics (based on native speaker responses)
speaker_stats <- df_main %>%
group_by(speaker_id) %>%
summarise(
mean_accuracy = mean(accuracy),
se = sd(accuracy) / sqrt(n()),
n = n()
) %>%
arrange(desc(mean_accuracy))
# Create enhanced speaker barplot
p4 <- ggplot(speaker_stats, aes(x = reorder(speaker_id, mean_accuracy), y = mean_accuracy)) +
geom_bar(stat = "identity", aes(fill = mean_accuracy), alpha = 0.85, width = 0.8) +
geom_errorbar(aes(ymin = mean_accuracy - se, ymax = mean_accuracy + se),
width = 0.2, linewidth = 0.6, color = "gray30") +
geom_hline(yintercept = mean(df_main$accuracy),
linetype = "dashed", color = "#E63946", linewidth = 1) +
scale_fill_gradient2(low = "#2E86AB", mid = "#F77F00", high = "#06D6A0",
midpoint = mean(df_main$accuracy),
guide = "none") +
coord_flip() +
scale_y_continuous(labels = scales::percent_format(accuracy = 1),
breaks = seq(0.7, 1, 0.05)) +
labs(
x = NULL,
y = "Mean Accuracy",
title = "Speaker Intelligibility Ranking",
subtitle = "Red dashed line indicates grand mean across all speakers"
) +
theme_minimal(base_size = 12) +
theme(
panel.grid.major.y = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_line(color = "gray90", linewidth = 0.5),
axis.text.y = element_text(size = 10, color = "gray20"),
axis.text.x = element_text(size = 10, color = "gray20"),
axis.title.x = element_text(size = 12, margin = margin(t = 10)),
plot.title = element_text(size = 16, face = "bold", color = "gray10"),
plot.subtitle = element_text(size = 12, color = "gray40", margin = margin(b = 10)),
plot.background = element_rect(fill = "white", color = NA),
panel.background = element_rect(fill = "white", color = NA)
)
print(p4)

# Calculate statistics
n_speakers <- nrow(speaker_stats)
min_acc <- min(speaker_stats$mean_accuracy) * 100
max_acc <- max(speaker_stats$mean_accuracy) * 100
speaker_var <- var(speaker_stats$mean_accuracy)
The analysis of 15 L2 speakers reveals substantial individual
differences in intelligibility:
Range: Speaker intelligibility varied from 74.7% to
93.1%, representing an 18.4 percentage point spread. This wide range
underscores the importance of speaker selection in L2 speech perception
research.
Variance: The speaker variance of 0.0025 indicates
that individual speaker characteristics contribute substantially to
overall performance variability, beyond the effects of experimental
condition.
Implications: These speaker effects suggest that
perceptual adaptation to L2 speech may be heavily influenced by the
specific acoustic-phonetic characteristics of individual talkers, rather
than general properties of L1-influenced speech patterns.
Summary of Key Findings
# Overall statistics
overall_stats <- df_main %>%
group_by(phase) %>%
summarise(
mean_accuracy = mean(accuracy),
sd_accuracy = sd(accuracy),
n = n()
)
# Store key values
n_participants <- n_distinct(df_main$participant_id)
n_excluded <- n_distinct(df_main_all$participant_id) - n_distinct(df_main$participant_id)
n_trials <- nrow(df_main)
overall_acc <- mean(df_main$accuracy) * 100
overall_sd <- sd(df_main$accuracy) * 100
train_acc <- overall_stats$mean_accuracy[overall_stats$phase == "Training"] * 100
test_acc <- overall_stats$mean_accuracy[overall_stats$phase == "Testing"] * 100
# Adaptation cost analysis
costs <- adaptation_summary %>%
filter(condition %in% c("single-multi-excl-single", "multi-excl-single-single")) %>%
pull(mean_benefit)
# Note: p_values[3] contains the Same-Variety vs Different-Variety comparison (H2 test)
Experiment Summary
Sample Characteristics: - Analyzed: 834 native
English speakers - Excluded: 83 non-native speakers (separate analysis)
- Total trials: 25,020 (native speakers only) - Conditions: 6
experimental conditions (~139 participants each)
Overall Performance: - Grand mean accuracy: 87.3%
(SD = 16.8%) - Training phase: 86.8% - Testing phase: 87.7% - Overall
adaptation benefit: +0.9 percentage points
Hypothesis Tests:
✓ H1 PARTIALLY SUPPORTED: Talker-specific adaptation
found - Evidence: Same speaker condition outperformed Same-Variety
condition - Same vs. Same-Variety: p = .001 - Same vs. Diff-Variety: p =
.114 (not significant)
✗ H2 NOT SUPPORTED: Variety-general adaptation not
found - Different-Variety (88.5%) outperformed Same-Variety (86.3%), p =
.038 - This is opposite to the predicted direction - Shared L1
background hindered rather than facilitated generalization
H3 NOT CLEARLY SUPPORTED: Specialization patterns
unclear - Single→Multi adaptation: 0.43% (modest benefit, not cost) -
Multi→Single adaptation: 1.89% (strong benefit) - Both conditions showed
benefits rather than the expected cost-benefit tradeoff
Key Insights: 1. All conditions showed positive
adaptation effects 2. Multi→Single training showed the strongest
benefits (1.89%) 3. Evidence against variety-general adaptation (H2) -
different L1 > same L1 4. Substantial individual differences (ICC =
0.204) 5. Wide speaker intelligibility range (74.7% - 93.1%) 6. Native
speakers significantly outperformed non-native speakers 7. Nearly 40% of
trials achieved perfect accuracy
Discussion
Summary of Replication Attempt
This partial replication of Bradlow et al. (2023) examined perceptual
adaptation to L2-accented speech across six experimental conditions
manipulating speaker variability during training and testing phases.
From 1,370 complete Prolific submissions, 917 participants (33.1%
exclusion rate) met our preregistered inclusion criteria. Our
main analyses focus on the 834 native English speakers (90.9%
of the valid sample), with a separate comparison examining differences
between native and non-native listeners. This approach ensures our
findings are directly comparable to the original study’s focus on L1
English speakers while also providing insights into how language
background affects perceptual adaptation.
Our primary finding supports the original study’s conclusion that
exposure configuration significantly impacts perceptual adaptation among
native English speakers. Notably, we observed an overall improvement
from training (86.8%) to testing (87.7%) phases, indicating general
perceptual learning across the experiment. The absolute adaptation
benefit analysis revealed that all conditions showed positive
adaptations, though the magnitude varied considerably (ranging from
0.10% to 1.89%).
Key findings include: - The Multi→Single condition
showed the highest adaptation benefit (1.89%), suggesting that training
with multiple speakers creates particularly robust representations that
transfer well to novel single speakers - The Same Speaker condition
showed strong adaptation (1.49%), providing partial support for
talker-specific adaptation (H1) - Surprisingly, the Same-Variety
condition performed worse than the Different-Variety condition,
contradicting the variety-general adaptation hypothesis (H2)
This last finding is particularly intriguing as it suggests that
matched L1 backgrounds may create interference rather than facilitation
in cross-talker generalization, possibly due to listeners forming overly
specific expectations about L1-influenced speech patterns.
Limitations and Future Directions
Several important limitations should be noted: - Reduced
statistical power: With only 139 participants per condition
(vs. planned 200), our power to detect medium effects dropped from 84%
to 72% - Lack of consolidation period: The absence of
the 11-hour delay between training and testing phases may have affected
the magnitude and nature of adaptation effects - Different task
demands: Full sentence transcription (vs. keyword
identification) likely taps into different cognitive processes and may
explain some divergence from original findings - Web-based
format: While allowing for larger sample sizes, this reduced
experimental control compared to laboratory settings -
Conceptual rather than direct replication: The numerous
methodological differences (corpus, noise conditions, time pressure,
response format) mean this study tests similar concepts rather than
directly replicating the original
Future work should explore the time course of adaptation with
multiple testing intervals, investigate whether these adaptation
patterns hold for more naturalistic conversational speech materials, and
examine why shared L1 background unexpectedly hindered rather than
facilitated cross-talker generalization.
This conceptual replication provides evidence that perceptual
adaptation to L2 speech is influenced by the variability of training
exposure among native English speakers, though the patterns are more
complex than originally hypothesized. Most notably, we found evidence
against the variety-general adaptation hypothesis (H2),
with shared L1 background actually hindering cross-talker
generalization. The unexpected finding that multiple-speaker training
led to the strongest adaptation benefits challenges assumptions about
specialization costs. The observed differences between native and
non-native listeners further suggest that adaptation mechanisms may
operate differently depending on listeners’ linguistic backgrounds,
warranting future investigation into the interaction between L1
experience and perceptual flexibility.
| ## Extension – Putting the Four Core Findings of Bradlow et
al. (2023) to the Test |
In this extension we ask whether our replication reproduces the four
headline patterns reported by Bradlow, Bassard & Paller
(2023).
All statistics are based on the native‑speaker subset
used above.
1 Low‑ vs High‑Variability Training
Original claim Single‑talker (low‑var) exposure can
be sufficient for cross‑talker adaptation, and multi‑talker
(high‑var) exposure does not always guarantee it.
library(lme4)
## Tag each TEST trial as "generalize" (new talker) or not
df_claim1 <- df_main %>%
mutate(generalize = case_when(
phase == "Training" ~ NA,
grepl("^single-single", condition) ~ (speaker_id != lag(speaker_id, 1)),
condition == "multi-multi-all-random" ~ FALSE,
condition == "multi-excl-single-single"|
condition == "single-multi-excl-single"~ TRUE,
TRUE ~ FALSE
))
## Mixed‑effects logistic model: Test‑phase only
m_claim1 <- glmer(
accuracy ~ generalize *
(grepl("^single", condition) %>% factor(levels = c(FALSE, TRUE))) +
(1|participant_id) + (1|stimulus_id),
data = filter(df_claim1, phase == "Testing"),
family = binomial
)
summary(m_claim1)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula:
## accuracy ~ generalize * (grepl("^single", condition) %>% factor(levels = c(FALSE,
## TRUE))) + (1 | participant_id) + (1 | stimulus_id)
## Data: filter(df_claim1, phase == "Testing")
##
## AIC BIC logLik deviance df.resid
## 4583.0 4627.6 -2285.5 4571.0 12504
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.8646 -0.6892 -0.0927 0.2056 0.2505
##
## Random effects:
## Groups Name Variance Std.Dev.
## participant_id (Intercept) 0 0
## stimulus_id (Intercept) 0 0
## Number of obs: 12510, groups: participant_id, 834; stimulus_id, 450
##
## Fixed effects:
## Estimate
## (Intercept) 2.7688
## generalizeTRUE 0.4496
## grepl("^single", condition) %>% factor(levels = c(FALSE, TRUE))TRUE 0.3951
## generalizeTRUE:grepl("^single", condition) %>% factor(levels = c(FALSE, TRUE))TRUE -0.6753
## Std. Error
## (Intercept) 0.0957
## generalizeTRUE 0.1458
## grepl("^single", condition) %>% factor(levels = c(FALSE, TRUE))TRUE 0.1161
## generalizeTRUE:grepl("^single", condition) %>% factor(levels = c(FALSE, TRUE))TRUE 0.1855
## z value
## (Intercept) 28.931
## generalizeTRUE 3.084
## grepl("^single", condition) %>% factor(levels = c(FALSE, TRUE))TRUE 3.403
## generalizeTRUE:grepl("^single", condition) %>% factor(levels = c(FALSE, TRUE))TRUE -3.640
## Pr(>|z|)
## (Intercept) < 2e-16
## generalizeTRUE 0.002042
## grepl("^single", condition) %>% factor(levels = c(FALSE, TRUE))TRUE 0.000666
## generalizeTRUE:grepl("^single", condition) %>% factor(levels = c(FALSE, TRUE))TRUE 0.000273
##
## (Intercept) ***
## generalizeTRUE **
## grepl("^single", condition) %>% factor(levels = c(FALSE, TRUE))TRUE ***
## generalizeTRUE:grepl("^single", condition) %>% factor(levels = c(FALSE, TRUE))TRUE ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) gnTRUE g(c%f=cT
## generlzTRUE -0.656
## g("^"c%f=cT -0.824 0.541
## gTRUEc%f=cT 0.516 -0.786 -0.626
## optimizer (Nelder_Mead) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
Interpretation
generalizeTRUE is positive &
significant after multi‑talker training (listeners profit from
variability).
- The negative interaction shows that after
single‑talker exposure, generalization suffers (≈ –2 pp).
Replication verdict – Claim 1: Partially
replicated. High variability again helps, but—unlike Bradlow et
al.—our single‑talker training was not sufficient for
equal cross‑talker gains.
2 Talker‑Specific Advantage
Original claim Matched training‑testing talker pairs
do not always beat mismatched pairs; the advantage is
inconsistent.
single_levels <- c("single-single-same",
"single-single-diff-same-variety",
"single-single-diff-diff-variety")
adapt_single <- adaptation_summary %>%
filter(condition %in% single_levels) %>%
mutate(cond = factor(condition, levels = single_levels))
pairwise.t.test(adapt_single$mean_benefit,
adapt_single$cond,
p.adjust.method = "bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: adapt_single$mean_benefit and adapt_single$cond
##
## single-single-same
## single-single-diff-same-variety -
## single-single-diff-diff-variety -
## single-single-diff-same-variety
## single-single-diff-same-variety -
## single-single-diff-diff-variety -
##
## P value adjustment method: bonferroni
Interpretation
- Same Speaker > Same Variety (Δ = 3.6 pp,
p =.001).
- Same Speaker vs Diff Variety = n.s.
Replication verdict – Claim 2: Partially
replicated. A talker‑specific boost appears, but it is
selective, echoing the original pattern.
3 Symmetry of Generalization
Original claim Generalization was
asymmetric: A→B ≠ B→A for some pairs.
## Participant‑level adaptation scores (already in adaptation_data)
adapt_participant <- adaptation_data %>%
select(condition, participant_id, adaptation_benefit) %>%
filter(!is.na(adaptation_benefit))
symmetry_pairs <- list(
c("single-multi-excl-single", "multi-excl-single-single"),
c("single-single-diff-same-variety", "single-single-diff-diff-variety")
)
for (p in symmetry_pairs) {
cat("\n── Pair:", paste(p, collapse = " ↔ "), "──\n")
A <- adapt_participant %>% filter(condition == p[1]) %>% pull(adaptation_benefit)
B <- adapt_participant %>% filter(condition == p[2]) %>% pull(adaptation_benefit)
print(t.test(A, B))
}
##
## ── Pair: single-multi-excl-single ↔ multi-excl-single-single ──
##
## Welch Two Sample t-test
##
## data: A and B
## t = -1.6003, df = 279.37, p-value = 0.1107
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.032719509 0.003375682
## sample estimates:
## mean of x mean of y
## 0.004250124 0.018922037
##
##
## ── Pair: single-single-diff-same-variety ↔ single-single-diff-diff-variety ──
##
## Welch Two Sample t-test
##
## data: A and B
## t = 0.65941, df = 267.55, p-value = 0.5102
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.01368469 0.02746733
## sample estimates:
## mean of x mean of y
## 0.0078829262 0.0009916072
Interpretation
Neither contrast is significant; adaptation is
symmetric in our data.
Replication verdict – Claim 3: Not
replicated. We do not reproduce the directional
asymmetries reported by Bradlow et al.
4 Training‑Phase Intelligibility as Moderator
Original claim Talkers who are difficult during
training produce smaller downstream gains (positive intelligibility →
adaptation correlation).
train_intel <- df_main %>%
filter(phase == "Training") %>%
group_by(speaker_id) %>%
summarise(train_acc = mean(accuracy))
test_general <- df_main %>%
filter(phase == "Testing") %>%
group_by(speaker_id) %>%
summarise(test_acc = mean(accuracy))
intel_adapt <- left_join(train_intel, test_general, by = "speaker_id") %>%
mutate(adapt_gain = test_acc - train_acc)
plot(intel_adapt$train_acc, intel_adapt$adapt_gain,
xlab = "Training‑Phase Intelligibility",
ylab = "Generalization Gain",
pch = 19)
abline(lm(adapt_gain ~ train_acc, intel_adapt), col = "red")

cor.test(~ train_acc + adapt_gain, data = intel_adapt)
##
## Pearson's product-moment correlation
##
## data: train_acc and adapt_gain
## t = -0.50562, df = 13, p-value = 0.6216
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6078923 0.4019850
## sample estimates:
## cor
## -0.1388753
Interpretation
Correlation r = –0.14, p = .62 → no
relationship.
Replication verdict – Claim 4: Not
replicated. We find no evidence that lower
training intelligibility predicts weaker generalization.
Replication Verdicts at a Glance
| 1 Low‑var suffices; high‑var not magic |
Partial – high‑var > low‑var; low‑var not
sufficient |
diverges |
| 2 Talker‑specific edge inconsistent |
Partial – edge only over Same‑Variety |
converges |
| 3 Generalization asymmetries |
No – effects symmetric |
diverges |
| 4 Intelligibility moderates learning |
No – r ≈ 0 |
diverges |
In sum, only the selective talker‑specific benefit (Finding 2) lines
up neatly with Bradlow et al. (2023); the other three patterns either
reverse or disappear in this web‑based, L2‑ARCTIC replication,
underscoring the need to map the boundary conditions of perceptual
adaptation to L2 speech.